110 research outputs found

    Optimizing substitution matrix choice and gap parameters for sequence alignment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.</p> <p>Results</p> <p>POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB.</p> <p>Conclusion</p> <p>The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at <url>http://www.drive5.com/pop</url>.</p

    progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement

    Get PDF
    Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms.We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pan-genome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence.The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve

    Meta-Alignment with Crumble and Prune: Partitioning very large alignment problems for performance and parallelization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Continuing research into the global multiple sequence alignment problem has resulted in more sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, more tractable sub-problems. We call Crumble and Prune <it>meta-alignment </it>methods because they use existing alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the problem such that many of the sub-problems can be solved independently. The results are then combined to form a solution to the full alignment problem.</p> <p>Results</p> <p>Crumble and Prune each provide a significant performance improvement with little loss of accuracy. In some cases, a gain in accuracy was observed. Crumble and Prune were tested on real and simulated data. Furthermore, we have implemented a system called Job-tree that allows hierarchical sub-problems to be solved in parallel on a compute cluster, significantly shortening the run-time.</p> <p>Conclusions</p> <p>These methods enabled us to solve gigabase alignment problems. These methods could enable a new generation of biologically realistic alignment algorithms to be applied to real world, large scale alignment problems.</p

    Optimality regions and fluctuations for Bernoulli last passage models

    Get PDF
    We study the sequence alignment problem and its independent version,the discrete Hammersley process with an exploration penalty. We obtain rigorous upper bounds for the number of optimality regions in both models near the soft edge.At zero penalty the independent model becomes an exactly solvable model and we identify cases for which the law of the last passage time converges to a Tracy-Widom law

    Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

    Get PDF
    BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net

    Fast Statistical Alignment

    Get PDF
    We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/

    Parameters for accurate genome alignment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed.</p> <p>Results</p> <p>We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases.</p> <p>Conclusions</p> <p>These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours <url>http://last.cbrc.jp/</url>.</p

    Evolutionary Modeling and Prediction of Non-Coding RNAs in Drosophila

    Get PDF
    We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3′ end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA

    Genetic polymorphisms in DNA repair and damage response genes and late normal tissue complications of radiotherapy for breast cancer

    Get PDF
    Breast-conserving surgery followed by radiotherapy is effective in reducing recurrence; however, telangiectasia and fibrosis can occur as late skin side effects. As radiotherapy acts through producing DNA damage, we investigated whether genetic variation in DNA repair and damage response confers increased susceptibility to develop late normal skin complications. Breast cancer patients who received radiotherapy after breast-conserving surgery were examined for late complications of radiotherapy after a median follow-up time of 51 months. Polymorphisms in genes involved in DNA repair (APEX1, XRCC1, XRCC2, XRCC3, XPD) and damage response (TP53, P21) were determined. Associations between telangiectasia and genotypes were assessed among 409 patients, using multivariate logistic regression. A total of 131 patients presented with telangiectasia and 28 patients with fibrosis. Patients with variant TP53 genotypes either for the Arg72Pro or the PIN3 polymorphism were at increased risk of telangiectasia. The odds ratios (OR) were 1.66 (95% confidence interval (CI): 1.02–2.72) for 72Pro carriers and 1.95 (95% CI: 1.13–3.35) for PIN3 A2 allele carriers compared with non-carriers. The TP53 haplotype containing both variant alleles was associated with almost a two-fold increase in risk (OR 1.97, 95% CI: 1.11–3.52) for telangiectasia. Variants in the TP53 gene may therefore modify the risk of late skin toxicity after radiotherapy
    corecore